{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "359697d5"
      },
      "source": [
        "# Vertex AI Search with Filters & Metadata\n",
        "\n",
        "<table align=\"left\">\n",
        "\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\">\n",
        "      View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/search_filters_metadata.ipynb\">\n",
        "      <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
        "      Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_filters_metadata.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>            \n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "31280da3e872"
      },
      "source": [
        "---\n",
        "\n",
        "* Author(s): [Ruchika Kharwar](https://github.com/rasalt)\n",
        "* Created: 5 Sep 2023\n",
        "\n",
        "---\n",
        "\n",
        "## Objective\n",
        "\n",
        "This notebook shows how to use [filters and metadata](https://cloud.google.com/generative-ai-app-builder/docs/filter-search-metadata) in search requests to [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/introduction).\n",
        "\n",
        "This works with unstructured apps that contain metadata. You can use metadata fields to restrict your search to a specific set of documents.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zj-AnsYaOSNq"
      },
      "source": [
        "Services used in the notebook:\n",
        "\n",
        "- ✅ Vertex AI Search for document search and retrieval"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pQbOO-Lc-2-7"
      },
      "source": [
        "## Install pre-requisites\n",
        "\n",
        "If running in Colab install the pre-requisites into the runtime. Otherwise it is assumed that the notebook is running in Vertex AI Workbench. In that case it is recommended to install the pre-requisites from a terminal using the `--user` option.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "ohPUPez8imvE"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython3.11 -m pip install --upgrade pip\u001b[0m\n",
            "Note: you may need to restart the kernel to use updated packages.\n"
          ]
        }
      ],
      "source": [
        "%pip install -q google-cloud-discoveryengine==0.11.2 --upgrade --user"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R5Xep4W9lq-Z"
      },
      "source": [
        "### Restart current runtime\n",
        "\n",
        "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "XRvKdaPDTznN"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'status': 'ok', 'restart': True}"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# Restart kernel after installs so that your environment can access the new packages\n",
        "\n",
        "import IPython\n",
        "\n",
        "app = IPython.Application.instance()\n",
        "app.kernel.do_shutdown(True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SbmM4z7FOBpM"
      },
      "source": [
        "<div class=\"alert alert-block alert-warning\">\n",
        "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n",
        "</div>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8vEczuYo_r-g"
      },
      "source": [
        "## Authenticate\n",
        "\n",
        "If running in Colab authenticate with `google.colab.google.auth` otherwise assume that running on Vertex AI Workbench.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "loTfn0KniwB2"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    from google.colab import auth as google_auth\n",
        "\n",
        "    google_auth.authenticate_user()\n",
        "\n",
        "from google.auth import default\n",
        "\n",
        "creds, _ = default()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_ARUa9gEK74l"
      },
      "source": [
        "## Configure notebook environment\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a5106c79f63a"
      },
      "source": [
        "## Data store metadata\n",
        "\n",
        "Metadata for the data store `<DATA_STORE_ID>` looks like this:\n",
        "\n",
        "```json\n",
        "{\n",
        "  \"id\": \"1\",\n",
        "  \"structData\": {\n",
        "    \"title\": \"Document1\",\n",
        "    \"category\": [\n",
        "      \"PersonaA\"\n",
        "    ],\n",
        "    \"name\": \"Document1\"\n",
        "  },\n",
        "  \"content\": {\n",
        "    \"mimeType\": \"application/pdf\",\n",
        "    \"uri\": \"gs://<BUCKETNAME>/data/Document1\"\n",
        "  }\n",
        "}\n",
        "```\n",
        "\n",
        "```json\n",
        "{\n",
        "  \"id\": \"2\",\n",
        "  \"structData\": {\n",
        "    \"title\": \"Document2\",\n",
        "    \"category\": [\n",
        "      \"PersonaA\",\n",
        "      \"PersonaB\"\n",
        "    ],\n",
        "    \"name\": \"Document2\"\n",
        "  },\n",
        "  \"content\": {\n",
        "    \"mimeType\": \"application/pdf\",\n",
        "    \"uri\": \"gs://<BUCKETNAME>/data/Document2\"\n",
        "  }\n",
        "}\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "66ff7614"
      },
      "source": [
        "### Set the following constants to reflect your environment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "058996c2ad84"
      },
      "outputs": [],
      "source": [
        "PROJECT_ID = \"<PROJECT_ID>\"\n",
        "LOCATION = \"global\"  # Replace with your data store location\n",
        "DATA_STORE_ID = \"<DATA_STORE_ID>\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8d308ea479a0"
      },
      "source": [
        "### REST API examples"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f3235a9f98cb"
      },
      "source": [
        "The filter `name: ANY(\"Document1\")` ensures the query is against only the documents with `name` matching `Document1`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6c68c5efc8d2"
      },
      "outputs": [],
      "source": [
        "%%bash -s \"$PROJECT_ID\" \"$LOCATION\" \"$DATA_STORE_ID\"\n",
        "\n",
        "project_id=$1\n",
        "location=$2\n",
        "data_store_id=$3\n",
        "\n",
        "curl -X POST -H \"Authorization: Bearer $(gcloud auth print-access-token)\" \\\n",
        "-H \"Content-Type: application/json\" \\\n",
        "\"https://discoveryengine.googleapis.com/v1beta/projects/$project_id/locations/$location/collections/default_collection/dataStores/$data_store_id/servingConfigs/default_search:search\" \\\n",
        "-d '{\n",
        "\"query\": \"claim\",\n",
        "\"filter\": \"name: ANY(\\\"Document1\\\")\"\n",
        "}'\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "654d958918bf"
      },
      "source": [
        "The filter `category: ANY(\"PersonaB\")` ensures the query is against only the documents with `name` matching `Document1`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7faa56904083"
      },
      "outputs": [],
      "source": [
        "%%bash -s \"$PROJECT_ID\" \"$LOCATION\" \"$DATA_STORE_ID\"\n",
        "\n",
        "project_id=$1\n",
        "location=$2\n",
        "data_store_id=$3\n",
        "\n",
        "curl -X POST -H \"Authorization: Bearer $(gcloud auth print-access-token)\" \\\n",
        "-H \"Content-Type: application/json\" \\\n",
        "\"https://discoveryengine.googleapis.com/v1beta/projects/$project_id/locations/$location/collections/default_collection/dataStores/$data_store_id/servingConfigs/default_search:search\" \\\n",
        "-d '{\n",
        "\"query\": \"claims\",\n",
        "\"filter\": \"category: ANY(\\\"PersonaB\\\")\"\n",
        "}'\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3fa2d40ae437"
      },
      "source": [
        "### Python code equivalent"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "68cf48879475"
      },
      "outputs": [],
      "source": [
        "from google.api_core.client_options import ClientOptions\n",
        "from google.cloud import discoveryengine_v1beta as discoveryengine\n",
        "\n",
        "\n",
        "def search_data_store(\n",
        "    project_id: str,\n",
        "    location: str,\n",
        "    data_store_id: str,\n",
        "    search_query: str,\n",
        "    filter_str: str,\n",
        ") -> discoveryengine.SearchResponse:\n",
        "    #  For more information, refer to:\n",
        "    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store\n",
        "    client_options = (\n",
        "        ClientOptions(api_endpoint=f\"{location}-discoveryengine.googleapis.com\")\n",
        "        if location != \"global\"\n",
        "        else None\n",
        "    )\n",
        "\n",
        "    # Create a client\n",
        "    client = discoveryengine.SearchServiceClient(client_options=client_options)\n",
        "\n",
        "    # The full resource name of the search engine serving config\n",
        "    # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}\n",
        "    serving_config = client.serving_config_path(\n",
        "        project=project_id,\n",
        "        location=location,\n",
        "        data_store=data_store_id,\n",
        "        serving_config=\"default_config\",\n",
        "    )\n",
        "\n",
        "    # Optional: Configuration options for search\n",
        "    # Refer to the `ContentSearchSpec` reference for all supported fields:\n",
        "    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec\n",
        "    content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(\n",
        "        # For information about snippets, refer to:\n",
        "        # https://cloud.google.com/generative-ai-app-builder/docs/snippets\n",
        "        snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(\n",
        "            return_snippet=True\n",
        "        ),\n",
        "        extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(\n",
        "            max_extractive_answer_count=5,\n",
        "            max_extractive_segment_count=1,\n",
        "        ),\n",
        "        # For information about search summaries, refer to:\n",
        "        # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries\n",
        "        summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(\n",
        "            summary_result_count=5,\n",
        "            include_citations=True,\n",
        "            ignore_adversarial_query=False,\n",
        "            ignore_non_summary_seeking_query=False,\n",
        "        ),\n",
        "    )\n",
        "\n",
        "    # Refer to the `SearchRequest` reference for all supported fields:\n",
        "    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest\n",
        "    request = discoveryengine.SearchRequest(\n",
        "        serving_config=serving_config,\n",
        "        query=search_query,\n",
        "        filter=filter_str,\n",
        "        page_size=5,\n",
        "        content_search_spec=content_search_spec,\n",
        "        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(\n",
        "            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,\n",
        "        ),\n",
        "        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(\n",
        "            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO\n",
        "        ),\n",
        "    )\n",
        "\n",
        "    response = client.search(request)\n",
        "    return response"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f9b9d4a02d1a"
      },
      "outputs": [],
      "source": [
        "search_query = \"how to process a claim\"\n",
        "filter_str = 'name: ANY(\"Document1\")'\n",
        "\n",
        "results = search_data_store(\n",
        "    PROJECT_ID, LOCATION, DATA_STORE_ID, search_query, filter_str\n",
        ")\n",
        "\n",
        "print(f\"\\nQuestion: '{search_query}'\\n\\n\")\n",
        "print(\"Summary\" + \"-\" * 40)\n",
        "print(results.summary.summary_text)\n",
        "\n",
        "print(\"Raw Results\" + \"-\" * 40)\n",
        "print(results)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4a927c39a4fc"
      },
      "source": [
        "Here is a slightly more complex filter based on 2 metadata values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c679d143e46d"
      },
      "outputs": [],
      "source": [
        "search_query = \"how to process a claim\"\n",
        "filter_str = 'name: ANY(\"Document1\") AND category: ANY(\"PersonaA\")'\n",
        "\n",
        "results = search_data_store(\n",
        "    PROJECT_ID, LOCATION, DATA_STORE_ID, search_query, filter_str\n",
        ")\n",
        "\n",
        "print(f\"\\nQuestion: '{search_query}'\\n\\n\")\n",
        "print(\"Summary\" + \"-\" * 40)\n",
        "print(results.summary.summary_text)\n",
        "\n",
        "print(\"Raw Results\" + \"-\" * 40)\n",
        "print(results)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "search_filters_metadata.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
